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Abstract 

Linear and geometric mixtures are two methods to combine arbitrary models in data 
compression. Geometric mixtures generalize the empirically well-performing PAQ7 
mixture. Both mixture schemes rely on weight vectors, which heavily determine their 
performance. Typically weight vectors are identified via Online Gradient Descent. In 
this work we show that one can obtain strong code length bounds for such a weight 
estimation scheme. These bounds hold for arbitrary input sequences. For this purpose 
we introduce the class of nice mixtures and analyze how Online Gradient Descent 
with a fixed step size combined with a nice mixture performs. These results translate 
to linear and geometric mixtures, which are nice, as we show. The results hold for 
PAQ7 mixtures as well, thus we provide the first theoretical analysis of PAQ7. 

1 Introduction 

Background. The combination of multiple probability distributions plays a key role in mod- 
ern statistical data compression algorithms, such as Prediction by Partial Matching (PPM), 
Context Tree Weighting (CTW) and "Pack" (PAQ) El E]. Statistical compression 
algorithms split compression into modeling and coding and process an input sequence symbol- 
by-symbol. During modeling a model computes a model distribution p and during coding 
an encoder maps the next character x, given p, to a codeword of a length close to — logp(x). 
Decoding is the very reverse: Given p and the codeword the decoder restores x. Arithmetic 
Coding (AC) is the de facto standard en-/decoder, it closely approximates the ideal code 
length [3 a . All of the aforementioned algorithms combine (or mix) multiple model distributions 
into a single model distribution in each step. PAQ is able to mix arbitrary distributions. As 
its superior empirical performance shows, mixing arbitrary models is a promising approach. 

Previous Work. To our knowledge there exist few compression algorithms which combine 
arbitrary models. Volf s Snake- and Switching-Algorithms [10] were the first approaches to 
combine just two arbitrary models. Kufleitner et al. [5] proposed Beta- Weighting, a CTW- 
spin-off, which mixes arbitrary models by weighting the model distributions linearly. The 
weights are posterior probabilities on the models (based on a given prior distribution) . Another 
linear weighting scheme was introduced by Veness [§] , who transferred techniques for tracking 
from the online learning literature to statistical data compression. His weighting scheme 
is based on a cleverly chosen prior distribution, which enjoys good theoretical guarantees. 
Starting in 2002 Mahoney introduced PAQ and its successors [7], which attracted great 
attention among practitioners. PAQ7 and its follow-ups combine models for a binary alphabet 
via a nonlinear ad-hoc neural network and adjust the network weights by Online Gradient 
Descent (OGD) with a fixed step size [7]. Up to 2012 there was no theoretical justification 
for PAQ7-mixing. In [6] we proposed geometric (a non-linear mixing scheme) and linear 
mixtures as solutions to two weighted divergence minimization problems. Geometric mixtures 
add a sound theoretical base to PAQ7-mixing and generalize it to non-binary alphabets. 
Both mixture schemes require weights, which we estimate via OGD with a fixed step size. 



In machine learning online parameter estimation via OGD and its analysis is well under- 
stood [2 j and has a variety of applications, which closely resemble mixture-based compression. 
Hence we can adopt machine learning analysis techniques for OGD in data compression 
to obtain theoretical guarantees. This work draws great inspiration from Zinkevich [T2], who 
introduced projection-based OGD in online learning and from Bianchi [I] and Warmuth 
[3] who analyzed OGD (without projection) in various online regression settings. 

Our Contribution. In this work we establish upper bounds on the code length for linear 
and geometric mixtures coupled with OGD using a fixed step size for weight estimation. The 
bounds show that the number of bits wasted w.r.t. a desirable competing scheme (such as 
a sequence of optimal weight vectors) is small. These results directly apply to PAQ7-mixing, 
since it is a geometric mixture for a binary alphabet and typically uses OGD with a fixed 
step size for weight estimation. Thus we provide the first theoretical guarantees for PAQ. To 
do so, in Section [3] we introduce the class of nice mixtures which we combine with OGD with 
a fixed step size and establish code length bounds. It turns out that the choice of the step size 
is of great importance. Next, in Section [4] we show that linear and geometric mixtures are nice 
mixtures and apply the results of Section [3] Finally in Section [5] we summarize our results. 

2 Preliminaries 

Notation. In general, calligraphic letters denote sets, lowercase boldface letters indicate col- 
umn vectors and boldface uppercase letters name matrices. The expression (oj)i<i< m expands 
to (ai a 2 . . . a m ) T where " T " is the transpose operator; the i-ih component of a vector a is 
labeld and its squared euclidean norm is \a\ 2 = a T a. By ej we denote the i-ih unit vector 
and 1 is (1 1 ... 1) T G R m . For any bounded set Wci m let |W| := sup a6eW |a - b\. 
Further, let S := {a G R m | a > and l T a = 1} (unit m-simplex). Let X := {1, 2, . . . , N} 
be an alphabet of cardinality 1 < iV < oo and let x b a := x a x a +i ...xjbea sequence over 
X where x n abbreviates x". The set of all probability distributions over X with non-zero 
probabilities on all letters is V+ and with probability at least e > on all letters is V e . For 
Pi,p 2 , • • • ,Pm G V C V + let p{x) = (pi(x))i<i< m be the vector of probabilities of x, the 
matrix P := (p(l) . . . p{N)) is called a probability matrix over V. Furthermore we set 
jwfo P) ■= maxi<j< m pi(a;) and p max (P) := max xeX p max (x; P); p min (x; P) and p min (P) 
are defined analogously. We omit the dependence on P, whenever clear from the context. 
The natural logarithm is "In", whereas "log" is the base-two logarithm. For a vector a with 
positive entries we define log a := (log aj)i<j< m . For x G X and p G V + we denote the (ideal) 
code length of x w.r.t. p as £(x,p) := — \ogp(x). The expression V w f := (df /dwi)i<i< m 
denotes the gradient of a function /, when unambigous we write V/ in place of V w f. 

The Setting. Recall the process of statistical data compression for a sequence x n over X 
(see Section [TJ , which we now formally refine to our setting of interest. Fix an arbitrary step 
1 < k < n. First, we represent the m > 1 model distributions pi, . . . ,p m G V+ (which may 
depend on x k ~ l and typically vary from step to step) in a probability matrix Pf.. One can 
think of x n and the sequence P n := Pi, . . . , P n of probability matrices over V+ as fixed. On 
the basis of Pj~ we determine a mixture distribution (for short mixture) Mix(w, Pk) for coding 
the k-th character Xk in £(xk, Mix(w, Pk)) bits. The mixture depends on a parameter vector 
or weight vector w = Wk which is typically constrained to a domain W (a non-empty, compact, 
convex subset of M m ). Based on an initial weight vector Wi (chosen by the user) we generate 
a sequence of weight vectors w 2 , w 3 , . . . via OGD: In step k we adjust Wk by a step towards 
d := —aV w £(xk, Mix(w, Pk)) where a > is the step size. The resulting vector v = Wk + d 
might not lie in W, the operation proj(u; W) := a,Tgmin w€ y^\v — w\ 2 maps a vector v G W 71 
back to the feasible set W and we obtain Wk+i = proj(t>; W). Algorithm [l] summarizes this 
process. Next we define the general term mixture as well as linear and geometric mixtures. 



Algorithm 1: MIX-OGD(«;i, a, x n , P r ' 



Input : a weight estimation W\ E W, a step size a > 0, a sequence x n over 

and a sequence P™ of probability matrices over V+ 
Output : a codeword for x n of length £(x n , MLX-OGD(u>i, a, x n , P n )) 

1 for k 1 to n do 

2 compute p Mix(it>fc, P k ) and emit a codeword for Xk sized £(x k ,p) bits; 

3 w fc+ i proj(it» fc - aV w £(x k ,Mlx(w, P k ))\ w=Wk - W); 



Definition 2.1. A mixture MIX : (io, P) h> p maps a probability matrix P over 7-+, given 
a parameter vector w drawn from the parameter space W, to a mixture distribution p E 7-+. 
The shorthand Mix(a;; w, P) is for p(x) where p = Mix(w, P). 

Definition 2.2. For weight (parameter) vector w E S and probability matrix P over "P+ 
the linear mixture LIN is defined by lin(x;w,P) := w T p(x). 

Definition 2.3. For weight (parameter) vector w E R m and probability matrix P over P + 
the geometric mixture GEO is defined by GEO(x; w,P) := UiLiPi(x) Wl / J2 y ex HiliPi(,y) w ' ■ 

Observation 2.4. If l(x) := -hgp(x), then GEO(x;iu,P) = 2~™ T ^V E^e* 2 ~™ T%) - 

In the following we will draw heavily on the alternate expression for GEO (a:; w, P) given in Ob- 
servation 2.4 This expression simplifies some of the upcoming calculations. Furthermore, let 

n 

l(x n ,MlX-OGD(w 1 ,a,x n ,P n )) := ^2i(x k ,Mlx(w k ,P k )) (for w k see Algorithm [lj, 

k=l 

n 

£(x n , P n , w, MIX) :=V £(x k , Mix(w, P k )) and t(x n , P\ Mix) :=min£(x n , P n , w, Mix). 
fc=i we>v 

3 Nice Mixtures and Code Length Bounds 

Nice mixtures. We now introduce a class of especially interesting mixtures. We call such 
mixtures nice. A nice mixture satisfies a couple of properties that allow us to derive bounds 
on the code length of combining such a mixture with OGD for parameter estimation (e.g. 
weight estimation). These properties have been chosen carefully, s.t. linear and geometric 
mixtures fall into the class of nice mixtures (see Section |4| . 

Definition 3.1. A mixture Mix is called nice if 

1. the parameter space W is a non-empty, compact and convex subset of lR m , 

2. £(x, Mix(u>, P)) is convex in w e W for all P over V + and all x E X, 

3. £(x, Mix(u>, P)) is differentiate by w for all P over V + and all x G X and 

4. there exists a constant a > s.t. (V^^, mlx(u>, P))| 2 < a ■ £(x, Mix(u>, P)) for all 
w E W, P over V + and x E X. 

Remark 3.2. Properties 1 to 3 are similar to the assumptions made in [T2], Property 4 differs. 
This allows us to obtain meaningful bounds on £(x n , Mlx-OGD(it>!, a, x n , P n )) when a is 
independent of n, as [U 0] show. 

Bounds on the Code Length for OGD. Algorithm [l] illustrates an online algorithm for 
mixture-based statistical data compression which employs a mixture Mix. We want to analyze 
the algorithm in terms of the number of bits required to encode a sequence when Mix is nice. 
We strive to show that in some sense the code length produced by Algorithm [T] is not much 
worse than a desirable competing scheme. At first we choose the code length produced by 
the best static weight vector w* = argmin TOg w £(x n , P n , w, Mix) as the competing scheme. 



Proposition 3.3. Algorithm [7] run with a nice mixture Mix, initial weight ve ctor W\ G W 
and step size a = 2(1 — b^ 1 )/ a for b > 1 (the constant a is due to Definition 3.1 Property 
satisfies 

a b 2 

£(x n ,ynx-OGr>( Wl ,a,x n ,P n )) < b ■ t(x n ,P n , mix) + Iw^w*] 2 , (1) 

4 6—1 

where w* minimizes £(x n , P n , w, Mix), for all x n over X and all P n over V+. 

Proof. For brevity we set £ k (w) := £(x k , Mix(i0, Pk))- As in j3], for arbitrary w G W, we 
first establish a lower bound on 

\w k - w\ 2 - \w k+1 - w\ 2 = \w k - w\ 2 - |proj(> fe - aV£ k (w k ); W) - w\ 2 . 

For v G W 71 and w G W it is well-known [12], that |proj(i>; W) — w\ < \v — w\, i.e. 

\w k - w\ 2 - \w k+1 - w\ 2 > \w k - w\ 2 - \(w k - w) - cN£ k {w k )\ 2 

= 2cN£ k (w k f(w k -w)- a 2 |V4(^ fc )| 2 - 



Since Mix is nice, £ k (w) is convex (due to Definition 3.1, Property |2|) and we have 
£ k (v) — £ k (w) < V£ k (v) J (v — w) for any w,v G W. We deduce 

\w k - w\ 2 - \w k+1 - w\ 2 > 2a(£ k (w k ) - £ k (w)) - a 2 |V4(^fc)| 2 

> 2a(£ k (w k ) - £ k (w)) - aa 2 £ k (w k ), (2) 

the last inequality follows from Definition 3.1, Property [4] Next, we sum the previous 
inequality over k to obtain (the sum telescopes) 

n n n 

a(2 - an) (4(™*)) ~2a^£ k (w) < Y^\w k -w\ 2 - \w k+1 - w\ 2 < \w 1 - w\ 2 , 

k=l k=l k=l 



which we solve for the first sum: 



n 2 n \wi - w\ 

E4K)<^-E(4H) + ' 1 



| 2 — aa a{2 — aa) 

Since this holds for any w, it must hold for w = w*, too. By the definition of £ k (w) we 
have Efc=i4(«'jfc) = ^(x n ,Mlx-OGD(it>i,a,x n , P n )) and Efe=i4('"') = £(x n , P n ,w, Mix). 
Our choice of a gives Q. □ 

Remark 3.4. The technique of using a progress invariant (c.f. Q) in the previous proof is 
adopted from the machine learning community, see [TJ 0] . These two papers assume that 
the domain of the parameter (weight) vector w is unbounded. Techniques of [T2] allow us 
to overcome this limitation. Proposition 3^ generalizes the analysis of online regression of 
PP to prediction functions f(w, z) (z is the input vector for a prediction) instead of f(w T z) 
when the domain of w is restricted. 

The previous proposition is good news. The number of bits required to code any sequence 
will be within a multiplicative constant b of the code length generated by weighting with an 
optimal fixed weight vector, £*(x n , P n , Mix), plus an 0(1) term. At the expense of increasing 
the 0(1) term we can set the multiplicative constant b arbitrarily close to 1. Note that the 
0(l)-term originates in the inaccuracy of the initial weight estimation \w\ — w*\ (see Q) 
and as b approaches 1, the step size a approaches zero. Hence the 0(1) term in ([!]) penalizes 
a slow movement away from w\. A high proximity of W\ to the optimal weight vector w* 



damps this penalization. We now make two key observations, which allow us to greatly 



strengthen the result of Proposition 3.3 



Observation 3.5. From the previous discussion we know that the significance of the 0(1) 
term vanishes as £*(x n , P n , Mix) grows. We can allow small values of b for large values of 
n, i.e., b may depend on n. Thus we choose b = 1 + f(n) where, f(n) decreases, and obtain 

£(x n , mix-OGd(u>i, at, x n , P n )) 

a(l + /(l))Va -w*\ 2 1 



< t (x n , P n , mix) + £*(x n , P n , mix) • f(n) 



4 fin)' 

If £*(x n , P n , mix) is 0(n) (i.e., Mix(x; w, P) is bounded below by a constant, which is a 
natural assumption) then the rightmost two terms on the previous line are 0(n- f (n) + f (n)^ 1 ) 
(since by Definition |3.1| Property [T] \w\ — w* | is 0(1)) and represent the number of bits wasted 
by mix-ogd w.r.t. £*{x n , P n , Mix). Clearly the rate of growth is minimized in the O-sense 
if we choose f(n) = n~ 1/2 , i.e. £{x n ,M\x-OGn(w 1 ,a,x n ,P n )) < £*(x n ,P n ,M\x) + 0(n 1/2 ). 
The average code length excess of mix-ogd over £*(x n , P n , Mix) vanishes asymptotically. 

Observation 3.6. The state of mix-ogd right after step k is captured completely by the 
single weight vector Wk+i- Hence we can view running mix-OGd(ioi, a, x n , P n ) as first 
executing MIX-OGD(iUi, a,x k , P k ) and running Mlx-OGD(u> fc+ i, a-, P^+i) afterwards. 
The code lengths for these procedures match for all 1 < k < n: 

£(x n , mix-ogd(ioi, a, x n , P n )) 

= £(x k ,Mix-OGB(w 1 ,a,x k ,P k )) +£(x% +l ,Mlx-OGT>(w k+1 ,a,xl +l P% +1 )). 



Given the previous observations as tools of trade we now enhance Proposition 3.3 



Theorem 3.7. We consider sequences ti = 1 < t 2 < ■ ■ ■ < t s < t s+ i = n + 1 of integers 
for 1 < s < n. Let £*(i,j, Mix) := £*(xj, P{, Mix). For all x n E X n , all P n over V+, any 
nice mixture MIX and any W\ e W Algorithm^ satisfies: 

1. If a = 2(1 — 6 _1 )/a, where b > 1, then 

~ab 2 \W\ 2 



(x n ,MlX-OGB(w lj a } x n } P n )) < min 

s,ta,...,t a 



4(6-1) 



s + 6^r(t i5 t m -i,Mix) 



(3) 



2. Ifa = 2/a • (1 + n 1 / 2 )" 1 (i.e., 6 = 1 + n^ 2 ) and £*(x n , P n , Mix) < c • n holds for a 
constant c > 0, all x n over X and all P n over V+ then 



(x n , MIX-OGD(it>i,a,x n ,P n )) < min 

S, t2,...,t s 



(as\W\ 2 +c) v^+E f (ti,ti+i-l,Mix) 



i=l 



(4) 



Proof We start proving d3l). First, we define £k{w) := £(xk, Mix(w, Pk))- By Observation 



3.6 for any 1 < s < n and ti — 1 < t% < ■ ■ ■ < t s < t s+1 = n + 1 we may write 

£(x n , MIX-OGD^!, a, x n , P n )) = £ t{x^- x , mix-ogd^, a, x^ 1 ' 1 , P^ 1 - 1 )) 



8=1 

12 l2 



1=1 



For the last step we used Proposition 3.3 the definition of £* (ti, 1, Mix) and Definition 3.1 
Property [T] which implies that \v — to] < |W| for any v,w E W. Since this holds for arbitrary 
s and t 2 , ■ ■ ■ ,t s we can take the minimum over the corresponding entities, which gives pi). 



Now we turn to Q. The choice b = 1 + n 1 ^ 2 follows from Observation 3.5 We combine 
b 2 /(b-l) < An 1 / 2 (by the choice of b) with £*(x n , P n , Mix) < on, i.e. £*(ij,x n ) < c-(j-i+l) 
for j > i in the r.h.s. of (|5|) to yield 



£(x n , mdc-OGd(«;i, a, x n } P n )) < a\W\ 2 s ■ n 1/2 + (1 + n~ l/2 ) £ t(t h t i+1 - 1, Mix) 

i=l 

s 

< (a\W\ 2 s + c) -n 1/2 + ^r(M m -l,MIx). 

i=l 

As in the proof of ^ we take the minimum over s and £2, • • • , t s , which gives Q. □ 



The previous theorem gives much stronger bounds than Proposition 3.3 since the com- 
peting scheme is a sequence of weight vectors with a total code length of £*{t\ 1 t2 — 1, Mix) + 
• • • + £*(t s , t s+ i — 1, mix), where the i-th weight vector minimizes the code length of the 
i-th subsequence x^ . . . Xt i+1 -i of x n . By (|3| the performance of Algorithm [I] is within a 
multiplicative constant b > 1 of the performance of any competing scheme (since in ^ we 
take the minimum over all competing schemes) plus an 0(s)-term, when a is independent 
of n. The 0(s) term penalizes the complexity of a competing predictor (the number s of 
subsequences). When a depends on n (c.f. Q) we can reduce the multiplicative constant to 1 
at the expense of increasing the penalty term to 0(s^/n), i.e. Algorithm pi will asymptotically 
perform not much worse than any such competing scheme with s = o\y/n) subsequences. 

4 Bounds for Geometric and Linear Mixtures 

Geometric and Linear Mixtures are Nice. We can only apply the machinery of the 
previous section to geometric and linear mixtures if they fall into the class of nice mixtures. 
Since the necessary conditions have been chosen carefully, this is the case: 

Lemma 4.1. The geometric mixture GEo(it;, P) i s nice for w G W, ifW is a compact and 
convex subset ofW n . Property^ of Definition pO is satisfied for a > log 2 (pmax/Pmin)- 

Lemma 4.2. The linear mixture lin(w, P) is nice. Property^ of Definition \3.1\ is satisfied 
for a> m log 2 (e) -3 — Pfr, r . 

Before we prove these two lemmas we give two technical results. The proofs of the lemmas 
below use standard calculus, we omit them for reasons of space. 

Lemma 4.3. For < z < 1 the function f(z) := — ^ satisfies f(z) > 1. 

Lemma 4.4. For 0<a<z<l — a the function f(z) := —z 2 \wz satisfies f(z) > f(a). 
Now we are ready to prove Lemma |4.1| and Lemma |4.2| 

Proof of Lemma \4-l\ Let p(x; w) := GEO(x; w, P) and £(w) := £(x, GEO(u>, P)). To show 



the claim we must make sure that properties [lfl4| of Definition |3.1| are met. By the constraint 
on W Property [l] is satisfied. Property [2] was shown in [6, Section 3.2]. To see that Property 
[3] holds, we set c := J2 y ex 2~ wTl ^ and compute 

9— w J l(y) 

V£(w) = V w (w T l(x) + logc) = l(x) - ]T %), 

yex c 

which is (by the definition of GEO) 



V£{w) = V w £(x, GEo(w, P j) = GEO(y; w, P) ■ (l(x) - l(y)) . (6) 



Clearly ^ is well-defined for the given range of w and P. For Property [2] we bound 
| W£(w) \ 2 /£(w ) from above by a constant; a takes at least the value of this constant. We obtain 



\V£(w)\ 2 <J2p(y;w)\l(x)-l(y)\ 2 

< J2p(V'i w)mlog 2 P "" x 
y+x 



2 Pi(y) 



EKy;™)E lo g : „, . 

y^x i=l /M X / 

2 Pmax 



i=l 

(1 — p(x; w))mlog 



and 



\V£(w)\ 2 (1 



inf 

0<z<l 



In z 
1-z 



Pmin Pmin 

p(x; w))mlog 2 (pss±) m \ Q g 2 (ps^) 
— logp(x; iu) ~~ log(e) 

By Lemma [43] the infimum is at least 1. This yields the claimed lower bound on a. □ 

Remark 4.5. It is interesting to note that we can express V w £(x, GEo(w, P)) (see ^) in terms 
of information theoretic quantities (for the basic notation see, e.g. [3]). The i-th component is 



\ogpi(x) - GEO(y ;w,P)(- log Pi(y)) 
y&x 

= - log pi (x) - GEO(y; w, P) log 

y&X 



1 



GEo(y;w,P)J 
-hgpi(x) - (H(geo(w,P)) + D(geo(w,P) II p^) 



+ log 



GEO(y;w,P) 

Pi(y) 



If we now ignore possible constraints on the weight vector w then for some character x a min- 
imizer of min w £(x, GEo(w, P)) satisfies H(geo(w, P)) + D(geo(w, P) \\ p,j) = - logp^x) 
for all 1 < i < m. In effect the weight vector w is chosen s.t. there is an equilibrium: The 
code length — log pi(x) matches the average code length of coding a symbol drawn from 
the source distribution geo(w, P) with the model distribution p^. 

Proof of Lemma \JJ% Again we set p(x; w) := LIN(x; u>, P) and £(w) := £(x,Lm(w, P)) 
and proceed analogously to the proof of Lemma |4~lj By Definition 2.2 we have w e S, Prop- 
erty [l] is met, and in [HI Section 4.2] we showed that Property [2] is met, as well. The gradient 



V£(w) = V w £(x,un(w,P)) = - V w log w T p(x) = -log(e 



p(x; w) 



is well-defined for the given range of w and P, so Property [3] is fulfilled. We observe that 



\V£{w) 



< 



mlog 2 (e)p ] 



,2 

max 



(w) p(x;w) 2 (— log p(x;w)) 



< m log(e)p- 



max 



inf • 

c<z<d 



-z 2 In z 



-i -l 



(7) 



where c = p mm < p(x; w) < p max < d = 1 - p mm . We used p max < 1 - p n 



Pn 



max maxp,(x) < max 1 — minp, (a; 

l<i<m xeX l<i<m \ x&X 



1 — min minpj(x) 

l<i<m xdX 



since 



1 ~Pn 



to apply Lemma 4.4 to bound the rightmost factor in Q from above by [— p min lnp 



The resulting constant on the r.h.s. of (j7| is a lower bound on a. The proof is done. □ 



Upper bounds on the Code Length. At this point we can combine Theorem |3.7| with 
Lemmas 41 and 4.2 to obtain code length bounds on Algorithm [T] for LIN and GEO. The dis- 
cussion in Section |3| on nice mixtures coupled with Algorithm [I] applies to LIN and GEO as well. 

Theorem 4.6. Let x n G X n , let P n be a sequence of probability matrices overV e where 
e = 2~ B for 1 < B < oo and let £*(k, I, BEST) := mmi<i< m £(x l k) l,m(e h P l k )) be the code 
length of the best single model for x l k . We consider sequences t\ = 1 < t 2 < • ■ • < t s < 



t s +i = n + 1 of integers where 1 < s < n. For MIX = LIN and MIX = GEO where W = S 
Algorithm^ satisfies the bounds in Table\^for the given step sizes for all u>i £ S. 

Proof. For the sake of simplicity we set LlN-OGD(a) := lin-ogd(iu 1 , a, x n , P n ). We start 
by proving row 1 in Table [TJ By Lemma 4.2 we can use Theorem 3.7 Equation Q with 
Mix = LIN, 6 = 2 and W = S where |<S| 2 < 2 which gives 

1 s 
a = - and £(x n , LiN-OGD(a)) < 2as + 2 t{t u t i+1 - 1, lin) (8) 

a i=i 
for any s, t 2 > • • • > Observe that 

t(k, I, lin) = mm£(x[, un(w, P l k )) < min £(xL UN(e h Pi)) = t(k, I, best) (9) 

■UJ&S l<i<m 



and by Lemma 4.2 we can choose 

a= •/(")> o, / / s > mlog 2 (e) = /"T, r. 10 

for some /(n) > 1. We set /(n) = 1 and combine (|9]) and ( |Xo| ) with (|8]) to yield 
85 , „, „ , „ \7msA B 



a 



and £(a; n ,LlN-OGD(a)) < — — V2^t(t h t i+ i - 1,best). 



17m4 B v ' v " ~ AB 



Finally we can take the minimum over s, t 2 , • • • , t s , since these were arbitrary, which gives 
the claim. Now we advance to Table [T] row 2. Again, by Lemma 42 we use Theorem 3.7 
Equation (lil) with Mix = lin, c = — logs = B and W = S which gives 



2/a s 
a = — '—= and £(x n , LlN-OGD(a)) < (2as + B)y/n + £*(t h t i+1 - 1,lin) (11) 



for any s, t 2 , ■ ■ ■ , t s . We now choose a as in ( 10 ) with 1 < f{n) = — ^> ^° S e ^ 

17ms4 s „. . . N 4 B 35ms4 B , . 

(2as + B) = —^-f(p) + B< (17ms + 1)— < (12) 

for the constant on the r.h.s. of (11). We combine Q and (12) with (11) to yield 

8B/Jn , „. „ . ,, 35ms4 B _ * 

a = ^ g and £(x , LlN-OGD(a)) < — — — y/n + 2_^£ (ti,t i+ i - 1,best). 

2=1 

Again, taking the minimum over s, t 2 , ■ ■ ■ , t s finishes the proof. The bounds of Table [l] rows 
3 and 4 follow analogously by the choice of /(n) = 1 (row 3) and f(n) = y+JTi ( row 4) an< ^ 

777252 7m 2 1 ^ ^log 2 (Pmax/Pmin) , , 

a = — ttt - ■ /(n) > — log - > : and by 

10 10 e loge 

£*(k,l,GEO) = mm£(x l k ,GEo(w,P l k )) < min £(x[,GF,o(e h P l k )) = l*(k, I, best) 

w£S l<i<m 



and using £*(x n , GEo(P n )) < B • n (a premise of Theorem 3.7 Item 111), since for all w E S 



GF,o(x: w,P) 



and consequently £(x, GEo(u>, P)) < B. 



> f[Pi(x) Wi >Pnun(x;P)=e = 2 



-B 



(13) 

□ 



Table 1: Code length bounds of Algorithm [T] for Mix = LIN and Mix = GEO where W = S. 



MIX 



a 



(x n , MIX-OGDfwi, a, x n , P n )) < min s? t 2 ... t v of 



1 LIN 



8B 



3 GEO 



YJmA B 
8B/^i 
17m4 B 
10 



2^r(t i ,t i+1 -l,BEST) 



i=l 



17ms4 B 
AB 



4^m tJ . j. i \ 35ms4 B 
2^t(t h t i+1 - 1,best) + — rr^V// 



i=i 



7mB 2 

lo/yfi 

7mB 2 



2^r(t i; t i+1 -l,BEST) + 



45 

7msB 2 



i=i 



. 19msB 2 
}2 f(.U, U+i - 1, best) + — — — y // 



i=i 



Remark 4.7. In the previous proof ( 13) shows that GEo(x; w, P) > Pmin{x] P) when w E S, 
just as LIN. Subsequently GEO cannot use more bits than LIN to encode a single symbol 
in the worst case. In the best case GEo(x; w, P) uses at most as much bits as LIN, since 

max GEO(x; w, P)) > max GEO(x; e», P) = p m ax{x] P) = max LIN(x; w, P). 



There exist situations where maxGEo(x; w, P) > maxLlN(a;; w, P), see Example 4.8 

WJ&S U!£cS 

Example 4.8. For an alphabet X — {1, 2, . . . , N}, iV > 2, we consider geo(u>, P) where 
w — (1/2 1/2) T and P T = (pi(x) P2{%))x&x s.t. for < e, q < 1 we have 



q ,x=l \q 

Pi(x):={(l-q)-(l-e) ,x = 2 , Pl (x) := j (1 - q) ■ (1 - e) ,x = 3 

^j 2 ' £ , otherwise 1 jv-2 £ , otherwise 



,x = 1 



The mixture probability geo(1; w, P) of the letter 1 is 

Pl (iy/ 2 ■ P2 (iy/ 2 



Ey & xPi(y) 1/2 ■P2{y) 1 



/2 



q/ 



, 5(1 -e) N - 3 



We now show, that for any q there exists an e, such that GEo(l; w, P) > p max (l; P) = q. 
Clearly, if geo(1; w,P) > q we must have f(e, N) < 1. To observe this we bound f(e, N) 
from above and give a possible choice for e. 



iV- 1 



iV-2 ' N-2~~ \N-2 ' y " ']jN-2 y/N^2 
If we choose < £ < (JV - 2)/(N - l) 2 it follows that f(e, N) < 1 and geo(1; w, P) > q. 



Note that the bounds in Table [TJ rows 3 and 4 only translate to PAQ7 if W = S. To 
obtain bounds for other weight spaces W we only need to substitute the approrpiate values 
for |W| and/or c > where £(x, GEo(u>, Pj) < c in the previous proof. E.g., if we have 
— r ■ 1 < w < r ■ 1 for r > then the penalization term of the bound in row 3 increases 
by a factor of |W| 2 /|<S| 2 = mr 2 . 

Veness [S] gave a bound for linear mixtures using a non-OGD weight estimation scheme 
which is identical to Table [I] row 2 except the penalty term, which is O(slogn) in place 



of 0(s^/n). However our analysis is based on Theorem 3.7 which applies to the strictly 
larger class of nice mixtures with a generic scheme for weight estimation. Clearly, more 
restrictions can pay off in tighter bounds, consequently we might obtain better bounds by 
taking advantage of the peculiarities of LIN and GEO. 

5 Conclusion 

In this work we obtained code length guarantees for a particular mixture-based adaptive 
statistical data compression algorithm. The algorithm of interest combines multiple model 
distributions via a mixture and employs OGD to adjust the mixture parameters (typically 
model weights) . As a cornerstone we introduced the class of nice mixtures and gave bounds on 
their code length in the aforementioned algorithm. Since, as we showed, linear and geometric 
mixtures are nice mixtures we were able to deduce code length guarantees for these two 
mixtures in the above data compression algorithm. Our results on geometric mixtures directly 
apply to PAQ7, a special case of geometric mixtures, and provide the first analysis of PAQ7. 

We defer an exhaustive experimental study on linear and geometric mixtures to future 
research. A straightforward extension to Theorem |3.7[ Item [2] is to remove the dependence 
of the step size on the sequence length (which is typically not known in advance). This can 
be accomplished by using the "doubling-trick" [2] or a decreasing step size [12]. Another 
interesting topic is whether geometric and/or linear mixtures have disjoint properties, which 
we can use to yield stronger bounds. This opposes our current approach, which we built 
on the (common) properties of a nice mixture. 

Acknowledgement. The author would like to thank Martin Dietzfelbinger, Michael Rink, 
Sascha Grau and the anonymous reviewers for valuable improvements to this work. 
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A Proof of Lemma 4.3 and Lemma 4.4 



Lemma 4.3. For < z < 1 the function f(z) := — j^- satisfies f(z) > 1. 

Proof. By the basic inequality — ln(z) > 1 — z the claim follows. □ 

Lemma 4.4. For 0<a<z<l — a the function f(z) := —z 2 Inz satisfies f(z) > f(a). 

Proof. First, we examine the derivative f'(z) = —z(l + 2 lnz) of /. Clearly, f'(z) > for 
< z < zq := 1/s/e and f'(z) < for Zq < z < 1. From a < 1 — a we conclude that 
a < |. We have f(z) > min{/(a), f(l — a)} (by monotonicity) and it remains to show that 

a) and observe that g(a) increases monotonically 
= 1. Finally we argue that g'(a) > where 



f(a)<f(l-a). Let 0(a) := f{a)/f{\ 
^ ; " - 9(a) < 9(1 



for < a < ^, i.o 



9W 



aln(a) ln(l - 
a-l) 3 ln 2 (l 



ln(l 



In a 



Clearly, the left factor is negative for < a < \. The rightmost factor is at most 0, since 



by Lemma 



4.3 



we have 



^<l/info<,<i 



In 2 



ln(l-a) 



< 1/inf 



ln^ 

0<z<l —JZ 



< 1 (we substituted z 



< 1, which concludes the proof. 



1 — a) and 

□ 



